Show code
pacman::p_load(jsonlite, tidygraph, ggraph, igraph,
visNetwork, graphlayouts, ggforce,
skimr, tidytext, topicmodels, tidyverse)Lee Peck Khee
June 5, 2023
June 18, 2023
As illegal, unreported and unregulated fishing continues to be a major contributor to overfishing worldwide, contributing to estimated global losses of approximately $50bn, FishEye hopes to leverage on visual analytics to understand patterns and highlight anomalous groups via the usage of knowledge graph.
For quick reference to the MC3 challenge, Question 1 write-up, please refer to Section 6.
The dataset utilised for the below analysis consists of a total of 27,622 nodes 24,038 edges and 7,794 connected components. It is an undirected multi-graph in a json format.
Possible node types include: {company and person}
Possible node sub types include: (beneficial owner and company contacts}
Possible edge types include: {person}
Possible edge sub types include: {beneficial owner and company contacts}
The full details can be found on: https://vast-challenge.github.io/2023/MC3.html
The code chunk below uses p_load() of pacman package to check if the said packages are installed in the computer. If they are, then they will be launched into R.
jsonlite: Enables us to import the json file for further analysis
tidygraph: Enables us to manipulate, analyze, and visualize graphs using a consistent and tidy syntax
ggraph: An extension of the ggplot2 package with tools to create visualizations of graphs and networks
visNetwork: Enables us to create interactive network visualizations in R
graphlayouts: Provides various graph layout algorithms such as Fruchterman-Reingold, Kamada-Kawai for graph visualisation
ggforce: Extension of the ggplot2 package by providing additional plotting functions and geoms
skimr: Provides tools for quickly summarizing and visualizing data in a tidy format, and enables one to get a quick overview of the data
tidytext: Enables us to perform various text preprocessing tasks and provides functions for analyzing text data
topicmodels: Provides functions for fitting and analyzing topic models, as well as identifying representative words for each topic
tidyverse: A collection of packages that enables a consistent and tidy data manipulation and analysis workflow in R
We first start off by loading the mc3.json dataset into “mc3_data” by using fromJSON() of jsonlit package below. The resulting output is “mc3_data” and is stored as a large list R object.
The code chunk below is used to extract the links dataframe of mc3_data and save it as tibble dataframe called mc3_edges.
Data cleaning is performed by utilising a combination of:
distinct(): to remove duplicates records
mutate() and as.character(): to convert the field data type from list to character
filter(): to remove records where source = target
Upon further inspection, we noticed that there are cells that contains a list of strings within the source column.
# A tibble: 2,169 × 3
source target type
<chr> <chr> <chr>
1 "c(\"Assam Limited Liability Company\", \"Assam Limited Lia… Marcu… Bene…
2 "c(\"Assam Limited Liability Company\", \"Assam Limited Lia… Keith… Bene…
3 "c(\"Assam Limited Liability Company\", \"Assam Limited Lia… Thoma… Bene…
4 "c(\"Assam Limited Liability Company\", \"Assam Limited Lia… Yolan… Bene…
5 "c(\"Assam Limited Liability Company\", \"Assam Limited Lia… Jenni… Bene…
6 "c(\"Assam Limited Liability Company\", \"Assam Limited Lia… Micha… Bene…
7 "c(\"Assam Limited Liability Company\", \"Assam Limited Lia… Saman… Comp…
8 "c(\"Oceanic Explorers Plc Salt spray\", \"The Salted Pearl Inc… Laure… Bene…
9 "c(\"Oceanic Explorers Plc Salt spray\", \"The Salted Pearl Inc… Natal… Bene…
10 "c(\"Oceanic Explorers Plc Salt spray\", \"The Salted Pearl Inc… Ricky… Comp…
# ℹ 2,159 more rows
As such, we utilised mutate() and separate_rows() to unpack the strings. Note that the key usage of separate_rows() is to handle data that are stored in a nested list format. Using separate_rows(), we can transform the data into a “tidy” format, with each value occupying its own row.
The code chunk below is used to extract the nodes dataframe of mc3_data and save it as tibble dataframe called mc3_nodes.
Let’s attempt to first perform a simple word count of the word fish. The below code chunk calculates the number of times the word fish appears in “product_services” column. We can see that there are several nodes with product_services that are not related to fish.
# A tibble: 27,622 × 6
id country type revenue_omu product_services n_fish
<chr> <chr> <chr> <dbl> <chr> <int>
1 Jones LLC ZH Comp… 310612303. Automobiles 0
2 Coleman, Hall and Lopez ZH Comp… 162734684. Passenger cars,… 0
3 Aqua Advancements Sashimi … Oceanus Comp… 115004667. Holding firm wh… 0
4 Makumba Ltd. Liability Co Utopor… Comp… 90986413. Car service, ca… 0
5 Taylor, Taylor and Farrell ZH Comp… 81466667. Fully electric … 0
6 Harmon, Edwards and Bates ZH Comp… 75070435. Discount superm… 0
7 Punjab s Marine conservati… Riodel… Comp… 72167572. Beef, pork, chi… 0
8 Assam Limited Liability … Utopor… Comp… 72162317. Power and Gas s… 0
9 Ianira Starfish Sagl Import Rio Is… Comp… 68832979. Light commercia… 0
10 Moran, Lewis and Jimenez ZH Comp… 65592906. Automobiles, tr… 0
# ℹ 27,612 more rows
The code chunk below leverages on unnest_token() of tidytext to split text in product_services column into individual words. First we have the output column name that will be created as the text is unnested into it (word, in this case), and then the input column that the text comes from (product_services, in this case).
Note that by default, all punctuation have been stripped, and all tokens are converted to lowercase to enable easy comparison.
Next, let’s leverage on ggplot() to visualise the words that were extracted via the below code chunk. We can see that there are several stopwords that are not meaningful, for example “of”, “as” and “for” etc.

token_nodes %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col(fill = "steelblue") +
xlab(NULL) +
theme_minimal() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5)) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in product_services field")Therefore, we proceed to remove the stopwords via a function from the tidytext package called stop_words. Note that the anti_join() function from the dplyr package is used to remove all stopwords from the analysis.
We then visualise the extracted words (without stopwords) using the below code chunk. We can see that there are still several words that are not related to our scope of analysis focusing on the fishing industry.

stopwords_removed %>%
count(word, sort = TRUE) %>%
top_n(30) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col(fill = "steelblue") +
xlab(NULL) +
theme_minimal() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5)) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in product_services field")As such, we proceed to remove words such as “character”, “0”, “unknown” etc. This step was done in an iterative manner to ensure that we capture as most fishery related keywords. We then visualise the top 30 words that are closely related to our analysis scope.

stopwords_removed %>%
filter(!word %in% c("character", "0", "unknown", "products",
"services", "food", "related", "equipment",
"accessories", "materials", "including",
"industrial", "meat", "canned", "systems", "freight",
"offers", "machines", "range", "processing",
"steel", "transportation", "supplies", "shoes",
"logistics", "vegetables", "metal", "solutions",
"packaging", "source", "researcher", "freelance",
"footwear", "management", "chemicals", "machinery",
"plastic", "air", "components", "manufacturing",
"tools", "distribution", "water", "foods", "wide",
"oil", "electronic", "fruits", "adhesives",
"apparel", "power", "bags", "care", "service",
"casting", "industry", "household", "oils",
"raw", "cargo", "technology", "specialty",
"aluminum", "home", "items", "grocery", "cooked",
"transport", "storage", "specialises", "smoked",
"rubber", "paper", "fabrics", "electrical", "control",
"activities", "line", "dried", "production", "construction",
"pharmaceutical", "machine", "clothing", "prepared",
"poultry", "canning", "product", "forwarding", "development",
"include", "glue", "furniture", "consumer", "business",
"automotive", "commercial", "fabric", "dry", "chemical",
"warehousing", "die", "customs", "sole", "iron",
"packing", "office", "industries", "applications",
"special", "preparation", "international", "beverages",
"gelatin", "design", "based", "natural", "meats", "custom",
"adhesive","textile", "system", "stationery", "processed",
"leather", "electric", "dairy", "trucking", "personal", "medical",
"hot", "fats", "building", "beef")) %>%
count(word, sort = TRUE) %>%
top_n(30) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col(fill = "steelblue") +
xlab(NULL) +
theme_minimal() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5)) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in product_services field")We also performed topic modelling in an attempt to further bucket keywords related to fishing so that this can help aid in further refining our analysis scope for mc3_nodes subsequently. We begin by creating a document-term matrix and storing it in dtm, before applying topic modelling using LDA to obtain 5 key topics. Finally, we then extract the top 50 terms associated with each topic.
Note that:
cast_dtm(): converts the count data into a document-term matrix, with each row representing a document and each column representing a term (ie. word)
LDA(): function from topicmodels package to create an LDA model
terms(): extracts the most probable term associated with each topic from the LDA model
Therefore, mc3_nodes was then furthered refined based on the keywords we identified from the sections 3.3.2.3 and 3.3.2.4 and saved as mc3_nodes_fish
mc3_nodes_fish <- mc3_nodes[grepl("\\b(aquatic|clams|cod|crab|crabs|crustaceans|fillet|fillets|fish(es)?|fishing|flounder|fresh|halibut|herring|lobster|marine|octopus|oysters|pacific|pollock|salmon|sea|seafood|seafoods|shellfish|shrimp|shrimps|sockeye|sole|squid|trout|tuna)\\b", mc3_nodes$product_services, ignore.case = TRUE), ]
mc3_nodes_fish# A tibble: 1,126 × 5
id country type revenue_omu product_services
<chr> <chr> <chr> <dbl> <chr>
1 Hopkins LLC ZH Comp… 14667942. Operates tramp …
2 Caracola del Mar NV Family Rio Is… Comp… 7085566. Canned, frozen …
3 Rollins, Mercado and Miller ZH Comp… 5672004. Pharmaceuticals…
4 Krause Ltd ZH Comp… 4532443. Fresh and cooke…
5 Hawkins-Benson ZH Comp… 2040575. Exports salmon,…
6 Sea Star LLC Shipping Zawali… Comp… 1507514. Operation of fi…
7 Hensley-Martinez ZH Comp… 1288524. Development of …
8 Jammu & Kashmir Sea Sagl Merchan… Puerto… Comp… 1275143. Land operations…
9 Mar de la Vida AG Rio Is… Comp… 1205868. Fish and meat p…
10 Turkish Calamari AB Marine conser… Kondan… Comp… 1167093. Standard, marin…
# ℹ 1,116 more rows
Upon further data investigation, we noticed that there are multiple rows with same ID. Therefore, we grouped by ID, country and type to gain the summed revenue_omu by each unique record. product_services was also subsequently concatenated and updated accordingly.
The below code chunk leverages on skim() of skimr package to display a summary statistics of mc3_edges tibble data frame. We can observe that there are no missing values in all the fields.
| Name | mc3_edges |
| Number of rows | 24937 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source | 0 | 1 | 6 | 81 | 0 | 13162 | 0 |
| target | 0 | 1 | 6 | 28 | 0 | 21265 | 0 |
| type | 0 | 1 | 16 | 16 | 0 | 2 | 0 |
The below code chunk leverages on datatable() of DT package to display mc3_edges tibble dataframe as an interactive table on the html document.
First, let’s take a look at the mc3_edges type of relationship categorisation. There are two main types, namely “Beneficial Owner” and “Company Contacts”.

ggplot(data = mc3_edges, aes(x = type)) +
geom_bar(fill = "steelblue") +
geom_text(
aes(label = ..count..),
stat = "count",
vjust = -0.5,
size = 3) +
theme_minimal() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(color = "grey", size = 0.5),
plot.title = element_text(hjust = 0.5)) +
labs(x = "Relationship Type", y = "Frequency Count") +
ggtitle("Distribution of Relationship Types within mc3_edges")Within this section, we calculated the number of companies each beneficial owner owns and save it as a new column in mc3_edges_bo as bo_target_count. Note that mc3_edges_bo only contains beneficial owner.
We utilised:
filter(): to filter for beneficial owner and company contact respectively
count(): to count the number of business a beneficial owner owns, as well as the number of companies that a company contact has access to respectively
From the plot below, we can see that most beneficial owners usually own 1 company. We also observe that it is rare for beneficial owners to own several companies

ggplot(mc3_edges_bo, aes(x = bo_target_count)) +
geom_bar(fill = "steelblue") +
geom_text(
aes(label = ..count..),
stat = "count",
vjust = -0.5,
size = 3) +
labs(x = "Number of Companies owned by Beneficial Owners", y = "Frequency Count") +
ggtitle("Distribution on Number of companies owned by Beneficial Owners") +
scale_x_continuous(breaks = seq(min(mc3_edges_bo$bo_target_count), max(mc3_edges_bo$bo_target_count), by = 1)) +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(color = "grey", size = 0.5),
plot.title = element_text(hjust = 0.5))Within this section, we calculated the number of companies each company contact is linked to and save it as a new column in mc3_edges_cc as cc_target_count. Likewise, we observe that company contacts are mostly linked to one companies, with a few exception where there are a minority group of company contacts that can be linked to more than 4 companies.
Note that mc3_edges_cc only contains company contacts.

ggplot(mc3_edges_cc, aes(x = cc_target_count)) +
geom_bar(fill = "steelblue") +
geom_text(
aes(label = ..count..),
stat = "count",
vjust = -0.5,
size = 3) +
labs(x = "Number of companies that Company Contacts are linked to", y = "Frequency Count") +
ggtitle("Distribution on Number of companies that Company Contacts are connected") +
scale_x_continuous(breaks = seq(min(mc3_edges_cc$cc_target_count), max(mc3_edges_cc$cc_target_count), by = 1)) +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(color = "grey", size = 0.5),
plot.title = element_text(hjust = 0.5))From the table below, we can see that there are approximately 89% of data available within revenue_omu variable. Hence, we need to exercise caution when using this column due to the missing data.
| Name | mc3_nodes_fish |
| Number of rows | 1116 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 8 | 56 | 0 | 1108 | 0 |
| country | 0 | 1 | 2 | 15 | 0 | 56 | 0 |
| type | 0 | 1 | 7 | 16 | 0 | 3 | 0 |
| product_services | 0 | 1 | 4 | 1139 | 0 | 719 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| revenue_omu | 122 | 0.89 | 6375658 | 35156735 | 4666.67 | 16874.84 | 36448.66 | 85524.67 | 308249623 | ▇▁▁▁▁ |
Similar to the above, we also visualise mc3_nodes_fish via an interactive table on the html document.
From the below plot, we can visualise the mc3_nodes types of relationship categorisation. There are three main types, namely “Beneficial Owner”, “Company” and “Company Contacts”.

ggplot(data = mc3_nodes, aes(x = type)) +
geom_bar(fill = "steelblue") +
geom_text(
aes(label = ..count..),
stat = "count",
vjust = -0.5,
size = 3) +
theme_minimal() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(color = "grey", size = 0.5),
plot.title = element_text(hjust = 0.5)) +
labs(x = "Relationship Type", y = "Frequency Count") +
ggtitle("Distribution of Relationship Types within mc3_nodes")Note that after cleaning to focus only on the fishing related product_services, we land at the below plot. We can observe that most of the nodes remaining are of the type “Company”.

ggplot(data = mc3_nodes_fish, aes(x = type)) +
geom_bar(fill = "steelblue") +
geom_text(
aes(label = ..count..),
stat = "count",
vjust = -0.5,
size = 3) +
theme_minimal() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(color = "grey", size = 0.5),
plot.title = element_text(hjust = 0.5)) +
labs(x = "Relationship Type", y = "Frequency Count") +
ggtitle("Distribution of Relationship Types within mc3_nodes")Let’s take a closer look at the country analysis within mc3_nodes_fish. Country ZH, followed by Oceanus and Marebak are the top 3 countries within mc3_nodes_fish.

country_counts <- mc3_nodes_fish %>%
count(country) %>%
top_n(5, n) %>%
arrange(desc(n))
ggplot(data = country_counts, aes(x = reorder(country, n), y = n)) +
geom_bar(stat = "identity", fill = "steelblue") +
geom_text(aes(label = n), vjust = -0.5, size = 3) +
theme_minimal() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(color = "grey", size = 0.5),
plot.title = element_text(hjust = 0.5)
) +
labs(x = "Country", y = "Count", title = "Distribution by Country (Top 5)")This step is necessary to ensure that mc3_edges_all, mc3_edges_bo and mc3_edges_cc contains only the filtered fishing related companies. As a recap:
mc3_edges_all contains the overall types (including beneficial owner and company contacts).
mc3_edges_bo contains only beneficial owner type.
mc3_edges_cc contains only company contacts type.
This step is necessary to ensure that the nodes in the mc3_nodes_all/bo/cc include all the source and target values from mc3_edges_all/bo/cc respectively.
id1 <- mc3_edges_all %>%
select(source) %>%
rename(id = source)
id2 <- mc3_edges_all %>%
select(target) %>%
rename(id = target)
mc3_nodes_all <- rbind(id1, id2) %>%
distinct() %>%
left_join(mc3_nodes_fish, by = "id",
unmatched = "drop")
id1_bo <- mc3_edges_bo %>%
select(source) %>%
rename(id = source)
id2_bo <- mc3_edges_bo %>%
select(target) %>%
rename(id = target)
mc3_nodes_bo <- rbind(id1_bo, id2_bo) %>%
distinct() %>%
left_join(mc3_nodes_fish, by = "id",
unmatched = "drop")
id1_cc <- mc3_edges_cc %>%
select(source) %>%
rename(id = source)
id2_cc <- mc3_edges_cc %>%
select(target) %>%
rename(id = target)
mc3_nodes_cc <- rbind(id1_cc, id2_cc) %>%
distinct() %>%
left_join(mc3_nodes_fish,
unmatched = "drop")Within this step, we aim to calculate from the Company’s perspective before storing it back to mc3_nodes_all as two new columns, namely:
company_contact_count: how many Company Contacts does the Company have?
beneficial_owner_count: how many Beneficial Owners does the Company have?
company_metrics <- mc3_edges_all %>%
group_by(source) %>%
summarise(
company_contact_count = sum(type == "Company Contacts"),
beneficial_owner_count = sum(type == "Beneficial Owner")
) %>%
ungroup()
mc3_nodes_all <- mc3_nodes_all %>%
left_join(company_metrics, by = c("id" = "source")) %>%
mutate(
company_contact_count = ifelse(is.na(company_contact_count), 0, company_contact_count),
beneficial_owner_count = ifelse(is.na(beneficial_owner_count), 0, beneficial_owner_count)
)Within this step below, we split the revenues into four main quantiles within mc3_nodes_all, mc3_nodes_bo and mc3_nodes_cc. Quantile 1 refers to the lowest end of the revenue scale, while Quantile 4 refers to the highest end of the revenue scale. Note that the below steps assumes that if revenue_omu is not available, it will be categorised under Quantile 1.
mc3_nodes_all <- mc3_nodes_all %>%
mutate(quantile = ifelse(is.na(revenue_omu), 1, ntile(revenue_omu, 4)))
quantile_counts_all <- mc3_nodes_all %>%
group_by(quantile) %>%
summarise(records = n())
mc3_nodes_bo <- mc3_nodes_bo %>%
mutate(quantile = ifelse(is.na(revenue_omu), 1, ntile(revenue_omu, 4)))
quantile_counts_bo <- mc3_nodes_bo %>%
group_by(quantile) %>%
summarise(records = n())
mc3_nodes_cc <- mc3_nodes_cc %>%
mutate(quantile = ifelse(is.na(revenue_omu), 1, ntile(revenue_omu, 4)))
quantile_counts_cc <- mc3_nodes_cc %>%
group_by(quantile) %>%
summarise(records = n())Within this section, we build a basic tidygraph data model for mc3_graph_all, mc3_graph_bo and mc3_graph_cc. Thereafter, in Section 4.2.2 to 4.2.4, we will explore the threshold to use (either top 30%, 20% or 10% based on degree/closeness/betweeness centrality) via a series of static plots.
A higher degree centrality indicates that the node has more connections than the average number of connections as compared to other nodes.
A higher closeness centrality indicates a shorter distance relative to all other nodes. It helps to detect nodes who can spread information very efficiently within a network.
A higher betweenness centrality measures the extent to which a particular node lies on the path between other nodes. Nodes with high betweenness can have significant influence within a network.
mc3_graph_all <- tbl_graph(nodes = mc3_nodes_all,
edges = mc3_edges_all,
directed = FALSE) %>%
mutate(betweenness_centrality = centrality_betweenness(),
closeness_centrality = centrality_closeness(),
degree_centrality = centrality_degree())
mc3_graph_bo <- tbl_graph(nodes = mc3_nodes_bo,
edges = mc3_edges_bo,
directed = FALSE) %>%
mutate(betweenness_centrality = centrality_betweenness(),
closeness_centrality = centrality_closeness(),
degree_centrality = centrality_degree())
mc3_graph_cc <- tbl_graph(nodes = mc3_nodes_cc,
edges = mc3_edges_cc,
directed = FALSE) %>%
mutate(betweenness_centrality = centrality_betweenness(),
closeness_centrality = centrality_closeness(),
degree_centrality = centrality_degree())The code chunk below utilises as_tibble() to convert the three graphs (mc3_graph_all, mc3_graph_bo, mc3_graph_cc) into a tibble format.
Within this section, we leveraged on top_frac to further filter the mc3_graph_all data set to the top 30%, 20% and 10% based on degree_centrality focusing on company contact count before determining the appropriate threshold for our subsequent analysis.



set.seed(123)
mc3_graph_all %>%
top_frac(0.30, wt = degree_centrality) %>%
ggraph(layout = "fr") +
geom_edge_link() +
geom_node_point(aes(
size = degree_centrality,
color = company_contact_count,
alpha = 0.1)) +
scale_size_continuous(range=c(3,10))+
scale_color_gradient(low = "gray", high = "red") +
theme_graph()Within this section, we leveraged on top_frac to further filter the mc3_graph_all data set to the top 30%, 20% and 10% based on degree_centrality focusing on beneficial owner count before determining the appropriate threshold for our subsequent analysis.



set.seed(123)
mc3_graph_all %>%
top_frac(0.30, wt = degree_centrality) %>%
ggraph(layout = "fr") +
geom_edge_link() +
geom_node_point(aes(
size = degree_centrality,
color = beneficial_owner_count,
alpha = 0.1)) +
scale_size_continuous(range=c(3,10))+
scale_color_gradient(low = "gray", high = "red") +
theme_graph()Within this section, we leveraged on top_frac to further filter the mc3_graph_all data set to the top 30%, 20% and 10% based on closeness_centrality focusing on company contact count before determining the appropriate threshold for our subsequent analysis.



set.seed(123)
mc3_graph_all %>%
top_frac(0.30, wt = closeness_centrality) %>%
ggraph(layout = "fr") +
geom_edge_link() +
geom_node_point(aes(
size = closeness_centrality,
color = company_contact_count,
alpha = 0.1)) +
scale_size_continuous(range=c(3,10))+
scale_color_gradient(low = "gray", high = "red") +
theme_graph()Within this section, we leveraged on top_frac to further filter the mc3_graph_all data set to the top 30%, 20% and 10% based on closeness_centrality focusing on beneficial owner count before determining the appropriate threshold for our subsequent analysis.



set.seed(123)
mc3_graph_all %>%
top_frac(0.30, wt = closeness_centrality) %>%
ggraph(layout = "fr") +
geom_edge_link() +
geom_node_point(aes(
size = closeness_centrality,
color = beneficial_owner_count,
alpha = 0.1)) +
scale_size_continuous(range=c(3,10))+
scale_color_gradient(low = "gray", high = "red") +
theme_graph()Within this section, we leveraged on top_frac to further filter the mc3_graph_all data set to the top 30%, 20% and 10% based on betweenness_centrality focusing on company contact count before determining the appropriate threshold for our subsequent analysis.



set.seed(123)
mc3_graph_all %>%
top_frac(0.30, wt = betweenness_centrality) %>%
ggraph(layout = "fr") +
geom_edge_link() +
geom_node_point(aes(
size = betweenness_centrality,
color = company_contact_count,
alpha = 0.1)) +
scale_size_continuous(range=c(3,10))+
scale_color_gradient(low = "gray", high = "red") +
theme_graph()Within this section, we leveraged on top_frac to further filter the mc3_graph_all data set to the top 30%, 20% and 10% based on betweenness_centrality focusing on beneficial owner count before determining the appropriate threshold for our subsequent analysis.



set.seed(123)
mc3_graph_all %>%
top_frac(0.30, wt = betweenness_centrality) %>%
ggraph(layout = "fr") +
geom_edge_link() +
geom_node_point(aes(
size = betweenness_centrality,
color = beneficial_owner_count,
alpha = 0.1)) +
scale_size_continuous(range=c(3,10))+
scale_color_gradient(low = "gray", high = "red") +
theme_graph()Within this section, we leveraged on top_frac to further filter the mc3_graph_bo data set to the top 30%, 20% and 10% based on degree_centrality focusing on revenue quantile before determining the appropriate threshold for our subsequent analysis.



set.seed(123)
mc3_graph_bo %>%
top_frac(0.30, wt = degree_centrality) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(width = bo_target_count),
alpha=0.5) +
scale_edge_width(range = c(0.1,5)) +
geom_node_point(aes(
size = degree_centrality,
color = quantile,
alpha = 0.1)) +
scale_size_continuous(range=c(1,10))+
theme_graph()Within this section, we leveraged on top_frac to further filter the mc3_graph_bo data set to the top 30%, 20% and 10% based on closeness_centrality focusing on revenue quantile before determining the appropriate threshold for our subsequent analysis.



set.seed(123)
mc3_graph_bo %>%
top_frac(0.30, wt = closeness_centrality) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(width = bo_target_count),
alpha=0.5) +
scale_edge_width(range = c(0.1,5)) +
geom_node_point(aes(
size = closeness_centrality,
color = quantile,
alpha = 0.1)) +
scale_size_continuous(range=c(1,10))+
theme_graph()Within this section, we leveraged on top_frac to further filter the mc3_graph_bo data set to the top 30%, 20% and 10% based on betweenness_centrality focusing on revenue quantile before determining the appropriate threshold for our subsequent analysis.



set.seed(123)
mc3_graph_bo %>%
top_frac(0.30, wt = betweenness_centrality) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(width = bo_target_count),
alpha=0.5) +
scale_edge_width(range = c(0.1,5)) +
geom_node_point(aes(
size = betweenness_centrality,
color = quantile,
alpha = 0.1)) +
scale_size_continuous(range=c(1,10))+
theme_graph()Within this section, we leveraged on top_frac to further filter the mc3_graph_cc data set to the top 30%, 20% and 10% based on degree_centrality focusing on revenue quantile before determining the appropriate threshold for our subsequent analysis.



set.seed(123)
mc3_graph_cc %>%
top_frac(0.30, wt = degree_centrality) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(width = cc_target_count),
alpha=0.5) +
scale_edge_width(range = c(0.1,5)) +
geom_node_point(aes(
size = degree_centrality,
color = quantile,
alpha = 0.1)) +
scale_size_continuous(range=c(1,10))+
theme_graph()Within this section, we leveraged on top_frac to further filter the mc3_graph_cc data set to the top 30%, 20% and 10% based on closeness_centrality focusing on revenue quantile before determining the appropriate threshold for our subsequent analysis.



set.seed(123)
mc3_graph_cc %>%
top_frac(0.30, wt = closeness_centrality) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(width = cc_target_count),
alpha=0.5) +
scale_edge_width(range = c(0.1,5)) +
geom_node_point(aes(
size = closeness_centrality,
color = quantile,
alpha = 0.1)) +
scale_size_continuous(range=c(1,10))+
theme_graph()Within this section, we leveraged on top_frac to further filter the mc3_graph_cc data set to the top 30%, 20% and 10% based on betweenness_centrality focusing on revenue quantile before determining the appropriate threshold for our subsequent analysis.



set.seed(123)
mc3_graph_cc %>%
top_frac(0.30, wt = betweenness_centrality) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(width = cc_target_count),
alpha=0.5) +
scale_edge_width(range = c(0.1,5)) +
geom_node_point(aes(
size = betweenness_centrality,
color = quantile,
alpha = 0.1)) +
scale_size_continuous(range=c(1,10))+
theme_graph()Within this section, we compute degree, betweenness and closeness centrality metrics and store it within mc3_nodes_all for easy reference in our subsequent analysis.
closenesscentrality <- closeness(mc3_graph_all, mode = "all")
mc3_nodes_all <- mc3_nodes_all %>%
mutate(closenesscentrality = closenesscentrality)
degreecentrality <- degree(mc3_graph_all, mode = "all")
mc3_nodes_all <- mc3_nodes_all %>%
mutate(degreecentrality = degreecentrality)
betweennesscentrality <- betweenness(mc3_graph_all, directed = FALSE)
mc3_nodes_all <- mc3_nodes_all %>%
mutate(betweennesscentrality = betweennesscentrality)After reviewing the above plots in section 4.2, we are of the view to gain a deeper insight via the top 20% nodes in terms of degree centrality.
This is because nodes with a higher degree centrality will have a high number of interacting neighbours. This might potentially help us to identify anomalies within the knowledge graph as companies who are involved in illegal fishing are more likely to have more beneficial owners.
According to an article “Fishy networks: Uncovering the companies and individuals behind illegal fishing globally”, it indicated that unscrupulous operators of vessels involved in IUU fishing takes advantage of a lack of regulations by using complex ownership structures to hide the identities of their ultimate beneficial owners (UBOs). Therefore, a company with more beneficial owners (and correspondingly a higher degree centrality) might be worth looking into further.
Likewise, illegal fishing typically involve complex network amongst various actors such as fishing companies, wholesalers and suppliers etc. Therefore, a higher number of company contacts can provide illegal fishing companies with access to valuable resources, markets and revenues to enable their illegal activtities. Therefore, it is worthwhile looking a high degree centrality perspective in relation to company contacts.
The below code chunk defines the various threshold for mc3_graph_all, mc3_graph_bo and mc3_graph_cc by the top 20% degree centrality.
The below code chunk saves the respective graphs to a RDS file for subsequent usage.
Note that tidygraph model is in R list format. The code chunk below will be used to extract and convert the edges into a tibble data frame.
The below code chunk serves to prepare a nodes tibble data frame.
nodes_df_all <- mc3_graph_all_degree %>%
activate(nodes) %>%
as.tibble() %>%
rename(label = id) %>%
mutate(id=row_number()) %>%
select(id, label, country, type, product_services, revenue_omu, company_contact_count, beneficial_owner_count, quantile)
nodes_df_bo <- mc3_graph_bo_degree %>%
activate(nodes) %>%
as.tibble() %>%
rename(label = id) %>%
mutate(id=row_number()) %>%
select(id, label, country, type, product_services, revenue_omu, quantile)
nodes_df_cc <- mc3_graph_cc_degree %>%
activate(nodes) %>%
as.tibble() %>%
rename(label = id) %>%
mutate(id=row_number()) %>%
select(id, label, country, type, product_services, revenue_omu, quantile)Within Section 5, we have:
Section 5.1: This section focuses on looking at the interactive network visualisation on an overall basis by revenue quantiles and country
Section 5.2: This section focuses on looking at the interactive network visualisation specifically on beneficial owner relationship with Company by revenue quantiles and country
Section 5.3: This section focuses on looking at the interactive network visualisation specifically on company contact relationship with Company by revenue quantiles and country
Note that the analysis can be done either by countries or revenue quantiles, where the objective is to find the node with a higher degree centrality.
set.seed(123)
id_node <- sort(nodes_df_all$id) # for the id nodes dropdown box
vis_plot_interactive_quantiles_all <- visNetwork(nodes = nodes_df_all, edges = edges_df_all) %>%
visIgraphLayout(layout = "layout_with_fr",
smooth = FALSE,
physics = TRUE
) %>%
visNodes(color = list(highlight = list(border = 'red', background = 'yellow', size = 50))) %>%
visEdges(color = list(highlight = "black"), arrows = 'to',
smooth = list(enabled = TRUE, type = "curvedCW")) %>%
visOptions(selectedBy = "quantile",
highlightNearest = list(enabled = TRUE,
degree = 1,
hover = TRUE,
labelOnly = TRUE),
nodesIdSelection = list(enabled = TRUE,
values = id_node)) %>%
visLegend(width = 0.1)
vis_plot_interactive_quantiles_allset.seed(123)
id_node <- sort(nodes_df_bo$id) # for the id nodes dropdown box
vis_plot_interactive_quantiles_bo <- visNetwork(nodes = nodes_df_bo, edges = edges_df_bo) %>%
visIgraphLayout(layout = "layout_with_fr",
smooth = FALSE,
physics = TRUE
) %>%
visNodes(color = list(highlight = list(border = 'red', background = 'yellow', size = 50))) %>%
visEdges(color = list(highlight = "black"), arrows = 'to',
smooth = list(enabled = TRUE, type = "curvedCW")) %>%
visOptions(selectedBy = "quantile",
highlightNearest = list(enabled = TRUE,
degree = 1,
hover = TRUE,
labelOnly = TRUE),
nodesIdSelection = list(enabled = TRUE,
values = id_node)) %>%
visLegend(width = 0.1)
vis_plot_interactive_quantiles_boset.seed(123)
id_node <- sort(nodes_df_cc$id) # for the id nodes dropdown box
vis_plot_interactive_quantiles_cc <- visNetwork(nodes = nodes_df_cc, edges = edges_df_cc) %>%
visIgraphLayout(layout = "layout_with_fr",
smooth = FALSE,
physics = TRUE
) %>%
visNodes(color = list(highlight = list(border = 'red', background = 'yellow', size = 50))) %>%
visEdges(color = list(highlight = "black"), arrows = 'to',
smooth = list(enabled = TRUE, type = "curvedCW")) %>%
visOptions(selectedBy = "quantile",
highlightNearest = list(enabled = TRUE,
degree = 1,
hover = TRUE,
labelOnly = TRUE),
nodesIdSelection = list(enabled = TRUE,
values = id_node)) %>%
visLegend(width = 0.1)
vis_plot_interactive_quantiles_ccQuestion: Use visual analytics to identify anomalies in the business groups present in the knowledge graph.
From the image below, we observe that:
most beneficial owners usually own only one company
most company contacts are usually only linked to one company.

It is useful to deep dive into countries with a high number of companies involved in fishing related activities (ZH and Oceanus from the below plot). In a study by Financial Transparency Coalition, fishing vessels flagged to Asia (particularly China), were found to have the world's largest distant water fleet, with 54.7% tagged as IUU fishing (The Guardian, 2022). Hence, focusing on a higher number of fleet occurrence in a country might potentially help detect IUU fishing.

While our interactive network visualisation focuses on the top 20% degree centrality, it is important/relevant for us to also compute closeness and betweenness centrality.
Higher degree centrality indicates that the node has more connections than the average number of connections as compared to other nodes.
Higher closeness centrality indicates a shorter distance relative to all other nodes.
Higher betweenness centrality measures the extent to which a particular node lies on the path between other nodes.
Revenue was divided into four main quantiles, with a scale of 1 (lowest revenue) to 4 (highest revenue).
According to an article “Fishy networks: Uncovering the companies and individuals behind illegal fishing globally”, vessels operators involved in IUU fishing were taking advantage on the lack of regulations by using complex ownership structures to hide the identities of their ultimate beneficial owners. Therefore, a company with more beneficial owners (and correspondingly a higher degree centrality) is worthy to investigate.
Likewise, illegal fishing typically involves various actors such as fishing companies, wholesalers and suppliers etc. A higher number of company contacts can provide illegal fishing companies with access to valuable resources and revenues to enable their illegal activities. Therefore, it is worthwhile to look from a high degree centrality perspective in relation to company contacts.
Leveraging on the interactive plots in Section 5, we focused on companies’ connection to beneficial owners and company contacts respectively. Using “Congo Rapids Ltd. Corporation” as an example, we can see that it has connections to several beneficial owners.

However, when we utilised DT::datatable(mc3_nodes_all) to gain an insight into the other centrality measures. It appears to have relatively low closeness centrality (0.0166667) but high betweenness centrality (1770). This suspicious company is owned by several beneficial owners (possibly to mask the ultimate beneficial owner), with low direct proximity to other companies (indicating a “relatively closed off network”). Yet, it serves as a significant intermediary between other companies in the flow of illegal fishing in the broader network. Hence, the said company can potentially be acting as a bridge to facilitate the illegal fishing network.

Next, we look at “By Country” perspective. Zooming into Oceanus, which has the second highest count in terms of fishing related companies, we detected “Aqua Aura SE Marine life” as having a lot of company contacts. Companies involved in illegal fishing might potentially be more prone to having a few company contacts to enable a higher revenue stream.

Utilising DT::datatable(mc3_nodes_all), we found that the said company has presence in two different countries. The first record in the table below was the one picked up via the network graph. Likewise, it has a high degree and betweenness centrality but low closeness centrality, hence making it a suspicious company potentially involved in illegal fishing.

Finally, focusing on “Revenue Quantile - 4”, this initially pointed our attention to”Zambezi Gorge Incorporated Consulting”. A closer look reveals that Adam Johnson is a rather suspicious node as it owns several companies (unlike most beneficial owner who usually owns 1 company).

Similar to Take Home Exercise 2, the most difficult aspect of working with this graph is in relation to the steep learning curve in relation to coding and computing resources. While I have an idea of going about analyzing or wrangling the data, it is difficult for me to put it into action as I am just beginning to be more familiar with R.
Thankfully, Professor Kam extended the submission due date and he was always open to taking time out for consultation. Professor Kam also helped me and the class to get up to speed by explaining data wrangling concepts and guiding me along the way in terms of framing my thoughts to enable me to complete the assignment.
I learnt a lot from this take-home exercise but as always, I’m sure there is more to learn in this journey.
Given that there were companies who were already identified as involved in IUU fishing, it will be useful to extend this analysis to a real world context. For example, looking at company contacts and beneficial owners of the said company already identified as involved in IUU fishing, this can help us scope our analysis towards those specified company contacts and beneficial owners to see which other companies (potentially suspicious) that they interact with.
Additionally, it will be useful to provide a systematic process towards identifying and grouping similar business, while leveraging on similarity measures to enable confidence in the visual groupings.